Red Wine Quality Exploration by Yiyi Tang

The dataset is about red wine quality, containing 1599 observations (wine) of
12 variables (chemical properties of wine).The variable ‘quality’ (based on
sensory data) score between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## Length  Class   Mode 
##      0   NULL   NULL
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ quality_f           : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "quality_f"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality      quality_f
##  Min.   : 8.40   Min.   :3.000   3: 10    
##  1st Qu.: 9.50   1st Qu.:5.000   4: 53    
##  Median :10.20   Median :6.000   5:681    
##  Mean   :10.42   Mean   :5.636   6:638    
##  3rd Qu.:11.10   3rd Qu.:6.000   7:199    
##  Max.   :14.90   Max.   :8.000   8: 18
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

All variables are numeric type except for quality, which is integer. I will
create a variable named ‘quality_f’ as factor.

In the dataset, ‘quality’ variable score between 3 - 8. Above results shows
the distribution of red wine of each quality score in the dataset. We can see
that most red wine’s quality score between 5 and 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

There’s a peak around 9.2 - 9.8 in distribution of ‘alcohol’ variable. Also, I
noticed few wine has exremely high alcohol (above 14, and between 14.5 and
15.0) and extremely low alcohol (below 9). Let’s look at these outliners in
the alcohol.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

It looks like alcohol outliners which have extremely low alcohol (below 9)
tend to be in low quality category 3, 4, 5 and 6, while alcohol outliners
which have extremely high alcohol (above 14) tend to be in high quality
category 5,6,7 and 8.

## # A tibble: 6 x 4
##   quality alco_mean alco_median     n
##     <int>     <dbl>       <dbl> <int>
## 1       3  9.955000       9.925    10
## 2       4 10.265094      10.000    53
## 3       5  9.899706       9.700   681
## 4       6 10.629519      10.500   638
## 5       7 11.465913      11.500   199
## 6       8 12.094444      12.150    18

I grouped a subset table ‘wine.alco_by_quality’, describing alcohol
categorized in quality. I noticed that the best quality category has the
biggest mean 12.09 and median of alcohol 12.88.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

I looked at the mean and median of alcohol in each quality category, and I’m
curious to find out if alcohol influence the quality of wine. And if there’s
other variables together with alcohol influence the quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric.acid is slightly skewed to the right. There’s a high peak in
‘citric.acid’ variable’s distribution at 0.00. It’s normal becasue citric.acid
often found small quantities in wine.

There’s another 3 relatively small peaks in the distribution. I also noticed
an outliner, which is at 1.00. Because citric.acid can add ‘freshness’ and
flavor to wines, I’m wondering if higher citric.acid positvely influence quality
of wines. And if the wine which have citric.acid equal to 1 are in better
quality.

##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 152 152           9.2             0.52           1            3.4
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 152      0.61                  32                   69  0.9996 2.74
##     sulphates alcohol quality quality_f
## 152         2     9.4       4         4

While it surprised me that the wine having maximum citric.acid is in quality 4,
which is not counted for a better quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Above two histograms show distribution of ‘fixed.acidity’ variable (do not
evaporate readily) and ‘volatile.acidity’ variable (represent the amount of
acetic acid in wine, which at too high of levels can lead to an unpleasant,
vinegar taste).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The residual.sugar is skewed to the right, with some outliners above 11. Most
of residual.sugar is between 1 and 3.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most of the chlorides are between 0.05 and 0.12.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Around 75% wine have density 0.9978. The median density is 0.9968, and the mean
density is 0.9967, which these two are pretty close.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH is fairly normally distributed with a few outliners. The mean pH is 3.311,
and around 75% of pH is 3.4.

## quality_bucket
##     Low (Rating 3 - 4) Medium (Rating 5 - 6)     High (Rating 7 - 8) 
##                     63                   1319                    217

I created quality_bucket to group quality ratings. Wines receiving 3 and 4 quality
score grouped in “Low” quality_bucket, wines receiving 5 and 6 quality score
grouped in “Medium” quality_bucket, and wines receiving 7 and 8 quality score
grouped in “High” quality_bucket.

Univariate Analysis

What is the structure of your dataset?

There are 1599 wine observations in the dataset with 12 features
(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and
quality). The output variable quality is based on sensor data, scoring between
0 and 10.

I set the ‘quality’ variable as ordered factor variable. Its levels are showed
as below:

(very bad) —–> (very excellent)

quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

While in the dataset, quality variable ranges between 3 and 8.

Other observations:

  • Most wine’s quality are 5 and 6.
  • The mean alcohol is 10.42%, and the median alcohol is 10.20%.
  • The min quality of wine in the dataset is 3, the max quality is 8,
    and the mean quality is 5.636.
  • About 75% of wine contains 2.6 g / dm^3 residual.sugar.
  • The mean citric.acid is 0.271 g / dm^3, and the max citric.acid is 1 g / dm^3.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in my dataset are quality, alcohol and
citric.acid. I’d like to know which feature or features combination are best
for predicting the quality of wine.

I suspect alcohol or citric.acid and some combination of the other variables
can influence the quality of wine. This suspection may help me build a
predictive model for wine quality in the following analysis.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Features like density and pH will help support my investigation
because I suspect alcohol might influence the density of water in wine, and pH
might be influenced by alcohol and citric.acid.

Did you create any new variables from existing variables in the dataset?

I created ‘quality_f’ variable as factor for further bivariate analysis, and
a quality bucket grouping qualities into ‘low’ ,‘medium’ and ‘high’. Also
I created a subset named ‘wine.alco_by_quality’ to better see if there’s
correlation between these two variables.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

  • I found some outliners in alcohol variable (below 9 or above 14). Also, I noticed that the best quality category has the biggest mean 12.09 and median of
    alcohol 12.88. But it doesn’t mean any liner or correlation between alcohol and
    quality. I will further analysize them in the following section.

  • Citric.acid distribution has several peaks and is slightly skewed to the
    right. The highest peak is at 0.00, and there’s another 3 relatively small
    peaks in the distribution. I also noticed an outliner, which is at 1.00. I
    checked the wine with 1.00 citric.acid and found it is in quality 4.

Bivariate Plots Section

From this matrix, I noticed that among my featured interested variables (alcohol,
quality, pH, density and citric.acid), there’s some meaningful correlations I would
like to take a look, such as correlation of quality and alcohol, alcohol and pH,
citric.acid and density, citric.acid and pH, citric.acid and quality. Becasue
these correlation value seem to be bigger than 0.3 or smaller than -0.3, which
means may have a meaningful correlation.

Relationship between alcohol and quality

I removed outliners in alcohol to see if the relationship between alcohol and
quality would be stronger. It turned out just a little bit stronger. So It’s
better to use Pearson’s correlation to test these two. And maybe there’s more
variables participate into this relationship.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Above Pearson’s correlation result shows there’s a moderate correlation between
alcohol and quality. To be more specific, wine with higher alcohol tend to be
in better quality. ### Relationship between citric.acid and density

ggplot(aes(x=citric.acid,y=density),data=wine)+
  geom_point(alpha=0.3,size=1)+
  geom_smooth(method=lm, se=FALSE, size=0.6)

cor.test(wine$citric.acid,wine$density)
## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$density
## t = 15.665, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3216809 0.4066925
## sample estimates:
##       cor 
## 0.3649472

There’s a positive meaningful but small correlation between citric.acid and density.

Relationship between alcohol and pH

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$pH
## t = 8.397, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1582061 0.2521123
## sample estimates:
##       cor 
## 0.2056325

Alcohol and pH have few correlation.

Relationship between alcohol and density

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

There’s a moderate correlation between alcohol and density variables. To be
specific, wine with higher alcohol tend to have lower density.

Relationship between citric.acid and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200
## # A tibble: 6 x 3
##   quality citric_mean `n()`
##     <int>       <dbl> <int>
## 1       3   0.1710000    10
## 2       4   0.1741509    53
## 3       5   0.2436858   681
## 4       6   0.2738245   638
## 5       7   0.3751759   199
## 6       8   0.3911111    18

Better quality wine have bigger mean of citric.acid.

While citric.acid would add ‘freshness’ or flavor to wine, there’s few correlation
between quality and citric.acid. But there’s a tendency that better quality wine
has higher mean citric.acid.

Relationship between citric.acid and alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$alcohol
## t = 4.4188, df = 1597, p-value = 1.059e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.06121189 0.15807276
## sample estimates:
##       cor 
## 0.1099032

Few correlation between citric.acid and alcohol.

Relationship between pH and density

## 
##  Pearson's product-moment correlation
## 
## data:  wine$pH and wine$density
## t = -14.53, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3842835 -0.2976642
## sample estimates:
##        cor 
## -0.3416993

There’s a meaningful but small correlation between pH and density.

Relationship between pH and citric.acid

## 
##  Pearson's product-moment correlation
## 
## data:  wine$citric.acid and wine$pH
## t = -25.767, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5756337 -0.5063336
## sample estimates:
##        cor 
## -0.5419041

pH and citric.acid have a moderate negative correlation around -0.5419.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

  • From the ggcorr correlation matrix, I found there might be some meaningful
    correlation between quality and alcohol, alcohol and pH, citric.acid and density,
    citric.acid and pH, citric.acid and quality.

  • I found alcohol and quality have a moderate correlation that wine with
    higher alcohol tend to be in better quality.The correaltion is around 0.476.

  • Few correlation is existed between quality and citric.acid. But I found that
    better quality wine has higher mean citric.acid. For example, the mean citric.acid
    of quality 8 wine is 0.3911, while the mean citric.acid of quality 4 wine is 0.1742.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

  • There???s a moderate correlation between alcohol and density. To be specific, wine with higher alcohol tend to have lower density. The correlation
    is around -0.496.

*citric.acid and density have a meaningful but small correlation around 0.36.

  • pH and density have a meaningful but small correlation around -0.34. To be
    specific, when density increase, pH tend to decrease.

  • pH and citric.acid have a moderate negative correlation around -0.5419.

What was the strongest relationship you found?

pH and citric.acid have the strongest relationship in my finding.

Multivariate Plots Section

Alcohol and density in quality category

It’s hard to see the results because of so much different colors. So I created
quality_bucket for better visualization.

It seems like three quality groups follow the relationship between density and
alcohol.

Density and pH in quality category

Quality groups follow the relationship of pH and density. And it’s clear to see
that low quality group has shorter range of pH and density, compared with medium
and high quality group.

Citric.acid and pH in quality category

Quality groups follow the relationship of pH and citric.acid. The low quality
group has a relatively bigger range of citric.acid. Also, I noticed there’s a lot
medium quality wine have 0 citric.acid, compared to low and high quality groups.

Citric.acid and density in quality category

Calculate r-squared value

By calculating r-squared value, I want to test if the strongest variable alcohol
would strong r-squared value to proof its linear relationship with quality.

m1 <- lm(wine$quality ~ wine$alcohol)
m2 <- lm(wine$quality ~ wine$alcohol+wine$density)
m3 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid)
m4 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid+wine$pH)

summary(m1)$r.squared
## [1] 0.2267344
summary(m2)$r.squared
## [1] 0.2317266
summary(m3)$r.squared
## [1] 0.2576685
summary(m4)$r.squared
## [1] 0.2626409

I chose alcohol (have the strongest correlation with quality among my interested
variable) to test the lineary relation with quality. Unfortunately, the r-squared
is not strong (0.22673).

But when I added each of the variables of interest into this model, the r-squared
value did improve from 0.22673 to 0.2626.

m5 <- lm(wine$density ~ wine$alcohol)
summary(m5)$r.squared
## [1] 0.2461944

Weak r-squared value to proof linear correlation between alcohol and density.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

  • From bivariate analysis I found out that density and alcohol have a modereate
    negative correlation. And from multivarite analysis by adding quality groups into
    the plot, I found out that quality gourps follow the relationship of density and
    alcohol.

  • I noticed that among my featured variables, alcohol has the strongest
    relationship with quality. So I calculated its r-squared value. Although the
    r-squared value between them is not strong (around 0.22673), it did improve
    from 0.22673 to 0.22626 when I added variables, such as
    density, citric.acid and pH, into the model.

Were there any interesting or surprising interactions between features?

  • Depending on the Pearson correlation value, I thought the r-squared value
    between alcohol and quality must be strong, at least bigger than 0.5. But it
    turned out my suspection was wrong. But it did surprised my that the r-squared
    value increased every time I added another featured variables into the model.

  • It also surprised me that quality groups all follow the meaningful relationships
    which I found in bivariate analysis. To be specific, quality groups follow the
    relationships of alcohol and density, density and pH, pH and citric.acid.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a math linear model.I sed quality as dependent variable, and alcohol as
independent variable. After I found out the r-squared value is not strong enough,
I added citric.acid, density, and pH one at a time as independent variable into
the model. The result r-squared value did improve, but still not strong enough.

The model clearly shows each r-squared value when you added a new featrued variable.
So it’s easy and clear to see the result that if they have linear correlation.
But there’s limitations of this model. Since I didn’t put all the variables in the
dataset to test the model. There may still be some major variable that I didn’t
include in the model.


Final Plots and Summary

Plot One

Description One

Alcohol and density have a moderate negative correlation around -0.496. Wine with
higher alcohol percentage by volume tend to have lower density (g / cm^3). And
all wine quality groups follow the relationship of alcohol and density.

Plot Two

Description Two

Alcohol have strongest correlation with quality around 0.476. Wines with higher
alcohol percentage by volume tend to be in better quality.But I did notice that
wine with quality scoring 5 is a bit out of the line. It might because there’s
still potential variables (toghether with alcohol to influence quality) that I
didn’t discuss.

Plot Three

Description Three

pH and citric acid (g / dm^3) have a moderate negative correlation around -0.5419.
Wine with higher citric acid (g / dm^3 ) tend to have lower pH. And all wine quality
groups follow this relationship of pH and citric acid. Also, low quality group of
wine tend to have larger range of citric acid (g / dm^3), compared to medium and
high quality group of wines.

Reflection

This Red Wine Quality dataset contained 1,599 observations of red wines. There’re
12 variables in the dataset, including 11 variables of chemical properties in
these wines, and 1 output variable of wine quality, which graded by experts and
is between 0 (very bad) and 10 (very excellent).

I’m interested in exploring how these chemical properties influence the quality
of wine. Through univariate, bivariate, multivariate analysis and statistical
analysis, I tested different relationships between these variables.

Among the variables included in the dataset, alcohol had the strongest correlation
with wine quality. The correlation is around 0.476. Wines with higher alcohol
percentage by volume tend to be in better quality. Unfortunately, the calculated
r-squared value between alcohol and quality is not strong (around 0.22673). But
when I added each of the variables (which I’m interested in this dataset) one at
a time into this model, the r-squared value did improve from 0.22673 to 0.2626.

I think the limitations of this dataset would be one of the major challenges.
Amond 1,599 obeservations of wines, 82.4% of wines received score of 5 or 6.
Around 4% of wines received score of 3 or 4, and 13.6% of wines received score
of 7 or 8. It would be better to have a larger variety of quality score for the
dataset.

For future further analysis, it would be interesting and meanfing to combine or
compare this dataset with the white wine datast. So we can see how these chemical
properties’ correlation with quality changed.

Reference: